Issues in building English-Chinese parallel corpora with WordNets
نویسندگان
چکیده
We discuss some of the issues in producing sense-tagged parallel corpora: including pre-processing, adding new entries and linking. We have preliminary results for three genres: stories, essays and tourism web pages, in both Chinese and English.
منابع مشابه
Developing Parallel Sense-tagged Corpora with Wordnets
Semantically annotated corpora play an important role in natural language processing. This paper presents the results of a pilot study on building a sense-tagged parallel corpus, part of ongoing construction of aligned corpora for four languages (English, Chinese, Japanese, and Indonesian) in four domains (story, essay, news, and tourism) from the NTU-Multilingual Corpus. Each subcorpus is firs...
متن کاملChinese-English Parallel Corpus Construction and its Application
Chinese-English parallel corpora are key resources for Chinese-English cross-language information processing, Chinese-English bilingual lexicography, Chinese-English language research and teaching. But so far large-scale Chinese-English corpus is still unavailable yet, given the difficulties and the intensive labours required. In this paper, our work towards building a large-scale Chinese-Engli...
متن کاملUsing Multilingual Resources for Building SloWNet Faster
This project report presents the results of an approach in which synsets for Slovene wordnet were induced automatically from parallel corpora and already existing wordnets. First, multilingual lexicons were obtained from word-aligned corpora and compared to the wordnets in various languages in order to disambiguate lexicon entries. Then appropriate synset ids were attached to Slovene entries fr...
متن کاملAutomatic creation of WordNets from parallel corpora
In this paper we present the evaluation results for the creation of WordNets for five languages (Spanish, French, German, Italian and Portuguese) using an approach based on parallel corpora. We have used three very large parallel corpora for our experiments: DGT-TM, EMEA and ECB. The English part of each corpus is semantically tagged using Freeling and UKB. After this step, the process of WordN...
متن کاملBootstrapping a Portuguese WordNet from Galician, Spanish and English Wordnets
In this article we exploit the possibility on bootstrapping an European Portuguese WordNet from the English, Spanish and Galician wordnets using Probabilistic Translation Dictionaries automatically created from parallel corpora. The process generated a total of 56 770 synsets and 97 058 variants. An evaluation of the results using the Brazilian OpenWordNet-PT as a gold standard resulted on a pr...
متن کامل